In: Eisner, J., L. Karttunen and A. Theriault (eds.), Finite-State Phonology: Proc. of the 5th 

Workshop of the ACL Special Interest Group in Computational Phonology (SIGPHOM) , pp. 46-56, 

Luxembourg, Aug. 2000. 

Multi-Syllable Phonotactic Modelling 



Anja Belz 

CCSRC, SRI International 
23 Millers Yard, Mill Lane 
Cambridge CB2 IRQ, UK 
anjab@cam. sri . com 



Abstract 

This paper describes a novel approach to construct- 
ing phonotactic models. The underlying theoretical 
approach to phonological description is the multi- 
syllable approach in which multiple syllable classes 
are defined that reflect phonotactically idiosyncratic 
syllable subcategories. A new finite-state formalism, 
OFS Modelling, is used as a tool for encoding, au- 
tomatically constructing and generalising phonotac- 
tic descriptions. Language- independent prototype 
models are constructed which are instantiated on the 
basis of data sets of phonological strings, and gener- 
alised with a clustering algorithm. The resulting ap- 
proach enables the automatic construction of phono- 
tactic models that encode arbitrarily close approxi- 
mations of a language's set of attested phonological 
forms. The approach is applied to the construction 
of multi-syllable word-level phonotactic models for 
German, English and Dutch. 

1 Introduction 

Finite-state models of phonotactics h ave been 



used i n automatic language identification (Kissman 
1995|; [Belz, 2000 ), in speech recogn it ion flCarson 



Berndsen, 1992|; Jusek et al., 1994 ; Jusek et al. 



1996 ; Carson-Berndsen, 2000] ), and optical character 



factors can more accurately account for the phonolo- 
gies of natural languages than analyses based on a 
single syllable class. 

Object-Based Finite State Modelling (previously 
described in Belz, 2000) is used as an encoding, 
construction and generalisation tool, and facili- 
tates Language- Independent Prototyping, where in- 
completely specified generic models are constructed 
for groups of languages and subsequently instanti- 
ated and generalised automatically to fully spec- 
ified, language-specific models using data sets of 
phoneme strings from individual languages. The 
theory-driven (manual) component in this construc- 
tion method is restricted to specifying the maxi- 
mum possible ways in which syllable phonotactics 
may differ in a family of languages, without hard- 
wiring the differences into the final models. The ac- 
tual construction of models for individual languages 
is a data-driven process and is done automatically. 

Sets of German, English and Dutch syllables were 
used extensively in the research described in this 
paper, both as a source of evidence in support of 
the multi-syllable approach (Section ||) and as data 
in automatic phonotactic model construction (Sec- 
tion U). All syllable sets were derived from sets of 
fully syllabified, phonetically transcribed forms col- 
lected fro m the lexical database CELEX ( Baayen et 



tical models (rt-gram or Markov models) are derived 
automatically from data, their symbolic equivalents 
are usually constructed in a painstaking manual 
process, and — because based on standard single- 
syllable phonological analyses — tend to overgen- 
eralise greatly over a language's set of wellformed 
phonological strings. This paper describes methods 
that enable the automatic construction of symbolic 
phonotactic models that are more accurate represen- 
tations of phonological grammars. 

The underlying theoretical approach to pho nolog- 
ical description i s the Multi- Syllable Approach ( Belz J 



recognition, among otner applications, wnile statis- aL; igg^_ CELEX contains compounds and phrases 



199$ ; Belz, 200C| ) . Syllable phonotactics vary consid- 



erably not only in correlation with a syllable's posi- 
tion within a word, but also with other factors such 
as position relative to word stress. Analyses based 
on multiple syllable classes defined to reflect such 



as well as single words. Phonological words were de- 
fined as any phonetic sequence with a single primary 
stress marker, and all other entries were disregarded. 

2 Multi-Syllable Phonotactics 

The multi-syllable approach works on the assump- 
tion that single-syllable approaches cannot ade- 
quately capture the phonological grammars of nat- 
ural languages, because they fail to account for the 
significant syllable-based phonotactic variation re- 
sulting from a range of factors that is evident in 
natural languages, and consequently overgeneralise 
greatly. 

Single-syllable analyses. The traditional view 
is that all syllables in a language share the same 
structure and compositional constraints which can 
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German 
all unique (%) 


English 
all unique (%) 


Dutch 
all unique (%) 


Initial 

Medial 

Final 

Monosyllables 


3,806 624 (16.4%) 
3,832 358 (9.34%) 
7,040 2,133 (30.3%) 
5,114 855 (16.72%) 


6,177 2,657 (43.01%) 
3,149 344 (10.92%) 
6,750 2,132 (31.59%) 
7,265 2,963 (40.78%) 


5,476 947 (17.29%) 
5,446 723 (13.28%) 
7,279 1,786 (24.54%) 
5,641 718 (12.73%) 


TOTAL 


10,606 3,970 (37.43%) 


14,333 8,096 (56.49%) 


11,448 4,174 (36.46%) 



Table 1: Syllable set sizes and number of syllables unique to each set (position). 



be captured by a single analysis. In many languages, 
however, the sets of word-initial and/or word-final 
consonant clusters differ significantly from other 
consonantal clusters ( Goldsmith, 199C| p. 107ff, lists 



several examples from different languages) . Such id- 
iosyncratic clusters have been treated as 'termina- 
tions ', 'appendices', or as 'extrasyllabic' ( poldsmith] 
199C ), and integrated along with syllables at the 
word-level. Similar, apparently irregular phenom- 
ena occur in correlation with tone and stress, and 
the first and last vocalic segments in phonological 
words are often analysed as 'extratonal' and 'extra- 
metrical'. However, such apparent irregularities are 
not restricted to the beginnings and ends of phono- 
logical words, and the phonotactics of syllables are 
affected by a range of factors other than position, 
which are difficult if not impossible to account for 
by the notion of extrasyllabicity. 

Three problematic issues arise in single-syllable 
analyses. Firstly, if a phonotactic model assumes 
a single syllable class for a language, and if the 
language has idiosyncratic word-initial and word- 
final phonotactics, then the set of possible phono- 
logical words that the model encodes is necessar- 
ily too large, and includes words that form system- 
atic (rather than accidental) gaps in the languages. 
Secondly, if extrasyllabicity is used to account for 
phonotactic idiosyncracies, then the resulting the- 
ory of syllable structure fails to account for ev- 
erything that it is intended to account for, and is 
forced to integrate constituents that are not sylla- 
bles (the eatfresyllabic material) at the word level. 
Thirdly, the notion of extrasyllabicity only works 
for cases where phonemic material can be segmented 
off adjacent syllables (most easily done at the begin- 
nings and ends of words), and cannot be used to 
account for syllable-internal variation. The alterna- 
tive offered by multi-syllable analyses is to make the 
universal assumption that position, stress and tone 
(among other factors) will result in variation in syl- 
lable phonotactics that are not necessarily restricted 
to any particular part of words, and to account for 
such variation systematically by the use of different 
syllable classes. 

Related approaches. The idea to discriminate 
between different syllable types, classified by word 



position and position with respect to the stressed 
syllable has been explored and utilised in previous 
research, for example in FSA-based phonotactic mod- 
els, typed formalisms, and in stoch astic p roduction 
rule grammars. Carson-Berndsen ( 1992 ) uses two 
separate FSAs to encode the phonotact ics of full and 
reduced syllables, and Jusek et al. ( |1994 ) distin- 
guish between stressed and unstressed syllables. In 
a typed feature system of mo rpho-phonology, Mas- 
troianni and Carpenter (1994) define subtypes of the 
general type syllable. 

The most closely related existing research is tha t 
presented by Coleman and Pierrehumbert ( 1997 ). 
The paper examines different possibilities for using 
a probabilistic grammar for English words to model 
native speakers' acceptability judgments. The pro- 
duction rule grammar encodes the phonotactics of 
English monosyllabic and bisyllabic words. Differ- 
ent probability distributions over paths in derivation 
trees are investigated which model likelihood of ac- 
ceptability to native speakers, rather than likelihood 
of occurrence. To build a grammar that accounts 
for interactions among onsets and rhymes, location 
with respect to the word edge and word stress pat- 
terns, six syllable types are distinguished which re- 
flect possible combinations of the features strong, 
weak, initial and final. The subsyllabic constituents 
onset and rhyme are similarly marked for stress and 
position. 

The present research extends existing work on syl- 
lable subclasses by applying the multi-syllable ap- 
proach systematically to model the entire phono- 
tactics of languages, and by using it for language- 
independent prototyping (see Section |3^ below) . 

Position-correlated phonotactic variation. 

Table [j] shows statistics for sets of monosyllabic 
words and initial, medial and final syllables in 
CELEX. For each language and each syllable set, the 
table shows the size of the set (e.g. there are 3,806 
different initial German syllables in CELEX), and 
the size of its subset of syllables that do not occur in 
any other set (e.g. 624 out of 3, 806 initial German 
syllables, or 16.4%, only occur word-initially). For 
all three languages, the figures show significant 
differences between the sets of syllables that can 
occur in the four different positions and their unique 
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Medial Final Mono 


German: 


Initial 

Medial 

Final 


2,619 (0.52) 1,466 (0.16) 1,392 (0.18) 
1,928 (0.22) 1,185 (0.15) 
3,873 (0.47) 


English: 


Initial 

Medial 

Final 


1,860 (0.25) 1,920 (0.17) 2,266 (0.20) 
1,787 (0.22) 1,008 (0.11) 
3,576 (0.34) 


Dutch: 


Initial 

Medial 

Final 


3,594 (0.49) 2,764 (0.28) 3,003 (0.37) 
3,279 (0.35) 2,428 (0.28) 
4,320 (0.50) 



Table 2: Intersections and set similarities for German, English and Dutch syllables (position). 





German 
all unique (%) 


English 
all unique (%) 


Dutch 
all unique (%) 


Stressed 
Pretonic 
Posttonic 
Plain 


8,919 2,977 (33.37%) 
989 30 (3.03%) 
5,897 388 (6.58%) 
6,819 229 (3.36%) 


9,399 5,280 (56.18%) 
3,201 1,362 (42.55%) 
4,754 670 (14.09%) 
6,020 944 (15.68%) 


9,934 3,484 (35.07%) 
1,780 71 (3.99%) 
5,960 517 (8.67%) 
6,662 176 (2.64%) 


TOTAL 


10,598 3,624 (34.20%) 


14,333 8,256 (57.60%) 


11,443 4,248 (37.12%) 



Table 3: Syllable set sizes and number of syllables unique to each set (stress). 



subsets. In German and Dutch, final syllables are 
particularly idiosyncratic, with 30.3% and 24.54%, 
respectively, not occurring in any other position. In 
English, all syllable sets except the medial syllables 
display a high degree of idiosyncracy. Table ^ 
shows the size of the intersections between the 
syllable sets, and the more objective measure of 
set similarity in brackets^. In German and Dutch, 
the similarity between initial and medial syllables, 
and between final and monosyllables is particularly 
high. The similarity between the least similar of 
syllable sets is much greater in Dutch than in either 
English or German. In English, only the final and 
monosyllables display any significant similarity. 
Average set similarity is highest in Dutch (0.37), 
followed by German (0.28), and English (0.21). 

Stress-correlated phonotactic variation. Ta- 
ble H shows analogous statistics for phonotactic vari- 
ation correlated with word stress. Set sizes and 
unique subset sizes are shown for the set of sylla- 
bles that carry primary stress (stressed), those im- 
mediately preceding stress (pretonic), those imme- 
diately following stress (posttonic), and all others 
(plain). In all three languages, the set of stressed 
syllables has least in common with other sets. In 
English, this is closely followed by the pretonic syl- 
lables. The average percentage of syllables unique 
to a set is highest in English, followed by Dutch and 
then German. 



These statistics show not only that there is signifi- 
cant syllable-level variation in the phonotactics of all 
three languages, but also that the simple strategy of 
subdividing the set of all syllables on the basis of po- 
sition and stress succeeds in capturing at least some 
of this variation. If a high percentage of syllables 
in one subcategory do not occur in any other, then 
distinguishing this syllable subcategory in a phono- 
tactic model will help reduce overgeneralisation. 

3 Encoding, Construction and 
Generalisation of Phonotactic 
Models 

3.1 Object-Based Finite-State Modelling 

The OFS Modelling formalism was used as a tool for 
encoding, constructing and generalising phonotactic 
models in the research described in Section 4. OFS 
Modelling consists of three main components, (i) a 
representation formalism, (ii) a mechanism for auto- 
matic model construction, and (iii) mechanisms for 
model generalisation. Brief summaries of the com- 
ponents that were used in the research described in 
this paper are given here (for full details see Belz, 
2000). 

Underlying OFS Modelling is a set of assump- 
tions about linguistic description that shares many 
of th e fundamen tal tenets of declarative phonol- 
ogy ( [Bird, 



1991 



1 Set similarity here is the standard measure of the size of 
the intersection over the size of the union of two sets Si and 
S 2 , or | Si n S 2 |/|Si U S 2 | (not defined for Si = S 2 = 0). 



for example). This set of as- 
sumptions includes a strictly non-derivational, non- 
transformational and constraint-based approach to 
linguistic description, and the principle of constraint 
inviolability. 
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The OFS formalism is a declarative, monostratal 
finite-state representation formalism that is intu- 
itively readable, facilitates the automatic data- 
driven construction of models, and permits the in- 
tegration of available prior, theoretical knowledge. 
The derivations (trees or brackettings) defined by 
OFS models correspond to context-free derivations 
with a limited tree depth or degree of nesting of 
brackets. This means that in OFS models (unlike 
in other normal forms for regular grammars), rules 
(hence expansions or brackets) can, if appropriately 
defined, systematically correspond to standard lin- 
guistic objects, the reason why the formalism is 
called object-based. 



OFS Model O = (N, T, P, n + 1) 



n: 




=► "0 


n-1: 




=> 










or 1 


—r 


1: 


ol 


wo 




ol 






o\ 




0: 




=* -o° 




ol 


=> W? 






=► CJ° 

— r 



Figure 1: Notational convention for OFS models. 

OFS Models. The OFS representation formalism 
is essentially a normal form for regular sets. OFS 
models can be interpreted in the same way as stan- 
dard production rule grammars, but are subject to 
a set of additional constraints. An OFS model O is 
denoted (N,T,P,n + 1), where N is a finite set of 
non-terminal objects Oj, < i < n, and T is a fi- 
nite set of terminals. P is an ordered finite set of 
n sets of productions Oj =>■ u>j, where Oj S N, and 
for i > 0, iJ l 3 -is a regular expression^ over symbols 
9 h 6 N,i > g, whereas for i = 0, is a set of 
strings^ from T*. An OFS model O has n levels, or 
sets of production rules, and each rule Oj => Wj is 

2 In the regular expressions in this paper, r* denotes any 
number of repetitions of r, r+ denotes at least one repetition 
of r, and r + e denotes the disjunction of r and e. 

3 The string sets in level rhss are actually implemented 
more efficiently as finite automata. 



uniquely associated with one of the levels. The nth 
set of production rules is a singleton set {Oq => loq}, 
and Oq is interpreted as the start symbol. The nota- 
tional convention adopted for OFS models is as shown 
in Figure |]. 



Definition 1 OFS Model 
An ofs model O is a 4-tuple (N, T,P,n + 1), 
where N is a finite set of nonterminals Ol , < 
i < n, Oq £ N is the start symbol, T is a finite 
set of terminals, n + 1 denotes the number of 
levels in the model, and P = 

{ {O? => utf }, 

{Oj} w5, 0\ =>• wj, . . . ; x =► w, 1 }, 
{O°^o;°, 0?=*w?, ...C>£=^ } }, 

where each rule Oj =>• is uniquely associated 
with one of the levels, uj® is a set of strings 
from T* , u)j,i > 0, is a regular expression over 
objects O g h e JV ', i > g. 



Each rule O ^ w in an OFS model corresponds to 
a set of strings which will be referred to as an object 
set or class, where O is the name of the object. The 
production rules in OFS models will also be referred 
to as object rules. 

OFS models thus differ from standard production 
rule grammars in three ways. Firstly, RHSs of rules 
above level are arbitrary regular expressions^. Sec- 
ondly, terminals from T are restricted to appear- 
ing in the RHSs of rules at level (mostly to fa- 
cilitate automatic model construction, see below). 
Thirdly, OFS models are limited in their representa- 
tional power to the finite-state domain by the con- 
straints that the RHSs of rules in rule sets at level 
i > are regular expressions over non-terminals that 
appear only in the lhss of rules in rule sets at lev- 
els g < i. That this limits representational power 
to the regular languages can be seen from the fact 
that all non-terminals Oj in the RHS of the single 
top-level rule can be substituted iteratively with the 
RHSs of the corresponding rules Oj u>j. This it- 
eration terminates after a finite time because there 
is a finite number of levels in the model, and at this 
point the RHS of the top-level rule contains only non- 
terminals, i.e. is a regular expression, hence repre- 
sents a regular language. 

Unlike other normal forms for regular production- 
rule grammars (such as left-linear and right-linear 

4 Other formalisms for linguistic analysis have permitted 
full regular expressions in the rhss of rules. For instance, 
in syntactic grammars, the recursive nature of some types of 
coordination has been modelled with right-recursive regular 
expressions (e.g. in gpsg). 
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sets of production rules), OFS models enable the defi- 
nition of production rules and hence derivations that 
can, if appropriately defined, correspond to standard 
linguistic objects and constituents (not possible in 
linear grammars) . Through the association of rules 
with a finite number of levels, OFS models permit the 
definition of grammars that encode sets of context- 
free derivations up to a maximum depth equal to the 
number of levels in the model. 

The fact that non-terminal strings are in OFS mod- 
els restricted to the lowest level, facilitates the com- 
bined theory and data driven construction of models. 
Uninstantiated models can be defined, that encode 
what is known in advance about the structural regu- 
larities of the object to be modelled in levels above 0, 
and have under-specified level RHSs that are sub- 
sequently instantiated on the basis of data sets of 
examples of the object to be modelled. OFS Mod- 
elling also has a generalisation procedure which can 
be used to generalise fully instantiated OFS models. 
Each of these mechanisms is described in turn over 
the following paragraphs. 

Uninstantiated OFS Models. In fully specified 
OFS models (as defined in the preceding section), 
the right-hand sides (rhss) of production rules at 
level i are regular expressions for i > 0, and string 
sets for i — 0. This separation makes it simple to 
construct incompletely specified models, or proto- 
type OFS models, where the RHSs of level rules are 
pattern descriptions rather than strings sets. Level 
RHSs in prototype models have the form Of => Si, 
where Of is the name of the object, and Si is a set 
former {x : viw 6 D, P\, P 2l . . . P n }, where v, w are 
concatenations of variables, D refers to any given fi- 
nite data set of strings, and P{, \ < i < n are prop- 
erties of the variables in v and w. 

Instantiation of Prototype OFS Models. The 

OFS instantiation procedure takes a prototype OFS 
model M for some linguistic object and a data set 
D of example members of the corresponding object 
class and proceeds as follows. For each level rule 
O- ^> Si in M, and for each element x of D, all 
substrings of x that match Si are collected. The 
resulting set of substrings becomes the new RHS of 
rule Of. After instantiation, level rules whose RHS 
is the empty set are removed, as are rules at higher 
levels whose RHSs contain non-terminals that can no 
longer be expanded by any of the production rules 
in M. 

Object-Set Generalisation. Instantiated OFS 
models can be generalised by object-set (os) gen- 
eralisation, where pairs of level object sets are 
compared on the basis of a standard set similar- 
ity measure aim for two finite sets D 1 and D 2 
(not defined for D 1 = D 2 = 0): sim(D u D 2 ) = 
\Di n D 2 \/\Di U D 2 \. The OS-generalisation pro- 



cedure takes a fully specified OFS model M and a 
given similarity threshold r, and, applying a sim- 
ple clustering algorithm, merges all object sets that 
have a similarity value sim matching or exceeding 
r. That is, the OS-generalisation procedure mea- 
sures the similarity between all pairs of level sets, 
and all pairs that match or exceed the threshold 
end up in the same cluster. Finally, the old object 
names (non-terminals) in the RHSs of object rules 
at levels above are replaced with the LHSs of the 
corresponding new merged object rule, while all ob- 
ject rules that now have identical RHSs are in turn 
merged. In this way, generalisation 'percolates' up- 
wards through the levels of the model. 

Determining an appropriate value for the simi- 
larity threshold r is not unproblematic. It could 
be set in relation to the average similarity value in 
an instantiated model (individually for each proto- 
type instantiation), but this approach would obscure 
the similarities that object-set generalisation (in par- 
ticular in conjunction with lip) is intended to ex- 
ploit. The whole point of object-set generalisation 
for language-independent prototypes is that it will 
merge a different number of level object classes in 
different prototype instantiations, creating different 
final, language-specific OFS models. If r is set in 
proportion to the average similarity between level 
classes, then this difference is reduced, and the re- 
sulting models will tend to retain the same number 
of level object classes from the prototype. For ex- 
ample, if the above prototype model Word is instan- 
tiated to a data set from a language that has phono- 
tactics which differ only between stressed and un- 
stressed syllables, then all similarity values between 
stressed syllable classes regardless of their position 
within a word, and between all posttonic, pretonic 
and plain syllables classes (again, regardless of posi- 
tion), will be very high. The average similarity value 
will therefore also be high. If r is set in relation to 
this high average, not all unstressed and all stressed 
syllable classes, respectively, will be merged, because 
not all syllable classes can exceed average similarity. 

Average similarity is a language-specific property, 
and so is the number of syllable classes similar 
enough to be merged for a given r value. For differ- 
ent generalised instantiations of the same prototype 
model to be comparable, object-set generalisation 
must have been carried out for each of them with 
the same r value. 

The threshold r is best regarded as a variable pa- 
rameter to the OS-generalisation procedure that can 
be used to control the degree to which a generalised 
OFS model will fit the data: the higher t, the more 
closely the model will fit the data, and the less it will 
generalise over it. This is particularly appropriate in 
phonotactic modelling, because phonotactics seeks 
to encode not just the set of attested words, but also 
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Prototype OFS Model Syllable — ({Syllable, Onset, Peak, Coda}, T, P, 2) 
1: Syllable =5- Onset Peak Coda 

0: Onset =^ {x \ xay G D, x G CONSONANTS*, a G VOWELS} 

Peak {x | yxz G D, x G VOWELS+, y, z G CONSONANTS*} 

Coda =>■ {x I T/ax G D, g G CONSONANTS* , a G } 



Figure 2: Simple prototype OFS model for syllable-level phonotactics. 

aez, as/, cnsk, assp, ass, ast, et, oik, o:ks, amts, o:, o:z, asks, ai, aiz, bei, ba:, ba:z, beib, bask, 
basks, si:, kasb, tfea*, tfead, smtf, smtft, kli:v, def, did, dju:st, dAvz, dra:fts, dweld, fai, fret, 
gauld, grDt, kwid, splast, sprirj, strasps, stAn 



Figure 3: Small data set of English monosyllabic words. 



OFS Model Syllable 


= ({Syllable, Onset, Peak, Coda}, T, P, 2) 




1: Syllable => 


Onset Peak Coda 




0: Onset =S> 


{ e, b, s, k, st, f, d, tf, kl, dj, dr, dw, fr, g, gr, kw, spl, spr, 


str } 


Peak => 


{as, a:, e, o:, ai, ei, i:, ea, a, au, d, i, u: } 




Coda =>■ 


{ e, b, s, k, st, f, d, z, J, sk, sp, ks, nts, *, ntf, ntft, v, 1, vz, 


fts, Id, t, n, ps, n } 



Figure 4: Syllable-level phonotactic OFS model instantiated with set of English monosyllables. 



OFS Model Syllable = ({Syllable, Onset _Coda, Peak, }, T, P, 2) 



1: Syllable 


> Onset Coda Peak Onset Coda 






0: Onset Coda = 


> { e, b, s, k, st, f, d, tf, kl, dj, dr, dw, fr, 


g, gr, kw, spl, spr, 


str, 




z, J, sk, sp, ks, nts, *, ntf, ntft, v, 1, vz 


, fts, Id, t, n, ps, n 


} 


Peak 


{as, a:, e, o:, ai, ei, i:, ea, a, au, d, i, u: 


} 





Figure 5: OFS model of Figure || generalised with r < 0.19. 



unattested, but wellformed words (often called 'ac- 
cidental' gaps), while excluding only illformed words 
(or 'systematic' gaps). There is no objective divid- 
ing line between idiosyncratic and systematic gaps, 
and setting r can be used as one way of controlling 
the degree of conservativeness in generalising over 
the set of attested words. 

3.2 Example 

As an illustration, consider the following example 
construction of a simple OFS model for syllable-level 
phonotactics (the constraints that hold on the possi- 
ble phoneme sequences within syllables)^. The pro- 
totype OFS model constructed in the first step (Fig- 
ure |) encodes the standard assumption that the 
syllable-level phonotactics in different languages can 
be appropriately modelled by interpreting syllables 
as a sequence of consonantal phonemes (onset) , fol- 
lowed by a sequence of vocalic phonemes (peak) , and 
another sequence of consonantal phonemes (coda) . 
In the second construction step, a data set of En- 



5 The example model is not intended to be a realistic 
phonotactic model, but is provided here merely as an illus- 
tration of the techniques outlined above. 



glish monosyllabic words (Figure |^) is used to in- 
stantiate the prototype OFS model. The instantia- 
tion procedure constructs an OFS model with new 
level RHSs as shown in Figure ^. During OS- 
generalisation, sim values are computed for each 
pair of level object sets. The only pairwise inter- 
section that is non-empty (hence the only non-zero 
sim value) in this example is that between the sets 
Coda and Onset (sim = 0.19), which are merged 
if OS-generalisation is applied to OFS model Syllable 
with t < 0.19, resulting in the simpler, more general 
OFS model shown in Figure pj. 

3.3 Language-Independent Prototyping 

Language-independent prototyping (lip) as a gen- 
eral approach to linguistic description seeks to de- 
fine generic models that restrict — in some linguis- 
tically meaningful way — the set of grammars or 
descriptions that can be inferred from data. OFS 
modelling can be used as an implementational tool 
for LIP. Language-independent prototype OFS mod- 
els can be defined by specifying a maximal number 
of objects and corresponding production rules such 
that when the prototype is instantiated and gener- 
alised with data sets from individual languages, dif- 
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Figure 6: Prototype OFS model for multi-syllable word-level phonotactics. 



ferent object sets will be deleted and merged for dif- 
ferent languages, resulting in different final, instan- 
tiated and generalised OFS models. In the following 
section, a language-independent phonotactic proto- 
type OFS model is instantiated to surprisingly differ- 
ent OFS models for three closely related languages. 

4 Multi- Syllable Phonotactic Models 
for German, English and Dutch 

When applied to modelling multi-syllable word-level 
phonotactics, LIP with OFS Modelling means defin- 
ing the maximum possible number of syllable classes 
that may be subject to different phonotactic con- 
straints in a given group of languages. The exact 
set of syllable classes depends on the group of lan- 
guages the prototype is intended to cover as well as 
the desired amount of generalisation over data (in 
general, a model that distinguishes only two syllable 
classes will generalise more than a model that distin- 
guishes three or more classes, given the same data). 
The prototype presented in this section is intended 
to cover German, English and Dutch, and takes into 
account only phonological factors (syntactic factors 
such as word category which can also affect phono- 
tactics are not taken into account). Two phonologi- 
cal factors are modelled: position of a syllable within 
a word, and position of a syllable relative to primary 
word stress. 

For this modelling task, the lip approach is im- 
plemented by constructing an OFS prototype model 



in which syllable classes reflecting all possible differ- 
ent combinations of position within a word and rel- 
ative to stress are defined as level uninstantiated 
object rules, and all possible ways in which the cor- 
responding objects can be combined to form words 
are defined as higher-level object rules. No prior as- 
sumptions about where phonotactic variation occurs 
is hardwired into the model. Instead, the maximal 
ways in which phonotactics may vary in a group of 
languages is encoded. The idea is that prototype in- 
stantiation and OS-generalisation with data sets of 
phonological words from different languages will re- 
sult in different final, instantiated phonotactic mod- 
els. 

4.1 Language-Independent Prototype OFS 
Model for Multi-syllable Phonotactics 

The prototype model shown in Figure ^ distin- 
guishes between twelve syllable classes which cor- 
respond to all possible combinations of position 
within a word and position relative to primary stress 
(' marks primary stress, — is the syllable separator, 
and S = syllable). As before, the set of all sylla- 
bles is divided into four classes on the basis of po- 
sition (mon = monosyllabic, ini = initial, med = 
medial, fin = final), each of which is divided fur- 
ther into four subclasses on the basis of stress (st = 
stressed, pr = pretonic, po = posttonic, pi = plain). 
This results in a total of 12 possible syllable cat- 
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Table 4: Sizes of Level object sets resulting from instantiations, and syllables unique to each set. 



egories[| D is the data set given in instantiation, 
and M the corresponding set of terminals (here, the 
phonemic symbols that occur in D). The RHS of the 
level 1 object rule encodes all possible ways in which 
the twelve syllable classes can theoretically combine 
to form words. The prototype model is language- 
independent, because not all syllable classes will ex- 
ist in all languages (e.g. a language where primary 
stress is always on the first syllable would not have 
classes of word- initial pretonic or plain syllables), 
and OS-generalisation will create different new syl- 
lable classes, depending on which classes are most 
similar in a given language. 

4.2 Prototype Model Instantiations 

Table ^ shows the sizes of the different level object 
sets resulting from OFS model instantiations to the 
German, English and Dutch word sets derived from 
CELEX (the syllable sets are far too large to be shown 
in their entirety). In all three languages, the largest 
syllable set is the set of stressed monosyllables, and 
the smallest is the set of medial pretonic syllables^. 
Table || also shows (in the same format as in Sec- 
tion 2) the number of syllables in each syllable class 
that do not occur in any of the other classes. 

In German and Dutch, percentages of unique syl- 
lables are significantly lower than in the classes 
reflecting position only and stress only that were 
shown in Section 2, indicating that some of the 
classes may not be worth distinguishing in phono- 
tactic models. In English, however, the higher per- 
centages of unique syllables are not far behind those 
shown previously, indicating that most of the twelve 



6 Not 4 x 4 = 16 classes, because some classes cannot exist 
(e.g. there is no such thing as a posttonic initial syllable). 

disregarding the set of plain monosyllables of which there 
were no examples in the Dutch section of celex, and only a 
very small number in the English section. 



syllable classes in the prototype are worth distin- 
guishing. 

Some correlation is evident between the size of a 
set and the percentage of unique syllables it contains. 
In German, average syllable set size is 2, 754 and the 
average percentage of unique syllables is 6.48%. Five 
syllable sets are of above average size, and four of 
these also have above-average percentages of unique 
syllables. Seven syllable sets are below average in 
size, and non of these have above-average percent- 
ages of unique syllables. In English, the picture is 
not as straightforward. Average syllable set size is 
2, 717, and average percentage of unique syllables is 
18.62%. Of the four sets of above-average size, two 
have above-average, and two have below-average, 
percentages of unique syllables. Of the seven En- 
glish syllable sets of below-average size (the set of 
plain monosyllables is disregarded again for English 
and Dutch), two have above-average, and five have 
below-average percentages of unique syllables. Fi- 
nally, in Dutch, average set size is 3,449 and aver- 
age percentage of unique syllables is 6.33%. Four 
of the six above-average sized sets also have above- 
average percentages of unique syllables, while all of 
the below-average sized sets also have below-average 
percentages of unique syllables. However, there is no 
complete correlation, with some of the largest sets 
having very small percentages of unique syllables, 
and vice versa. 

4.3 OS-Generalisation of Models 

As is clear from the instantiation results presented in 
the preceding section, some syllable classes contain 
such low percentages of unique syllables that it is 
not worth distinguishing them as a separate class. 
OS-generalisation of models can be used to merge 
the most similar classes and reduce the number of 
classes that the model distinguishes. 
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Figure 7: Cluster tree for German syllable sets. 



Figure 8: Cluster tree for Dutch syllable sets. 
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4.3.1 Generalisation of Multi- Syllable OFS 
Model for German 

Figure fj] shows the cluster tree for the German sylla- 
ble sets produced by carrying out OS-generalisation 
for r = 0.1. .1.0 in increments of 0.1. Each node in 
the tree shows at which r values the original sylla- 
ble sets at the leaves dominated by the node were 
merged. The tree reveals a very neat picture for 
German. 0.56 is the highest r value between any syl- 
lable class pair, so for r > 0.6 no classes are merged. 
t = 0.5 results in two clusters, one containing final 
unstressed syllables, the other initial and medial un- 
stressed syllables. At t — 0.4, all monosyllables are 
added to the final syllable class, and one more me- 
dial and one more initial class to the set of initial 
and medial syllables. At t — 0.3, all monosyllables 



and final syllables on the one hand, and all initial 
and medial syllables on the other, are merged. Set- 
ting r lower makes no difference until it is set below 
0.2, at which point all of the original syllable classes 
are merged into a single set. 

This shows clearly that in German the distinc- 
tion between monosyllables and final syllables on 
the one hand, and between initial and medial syl- 
lables on the other, is very strongly marked (pre- 
served even when r is set as low as 0.2). This distinc- 
tion is thus marked far more strongly than the un- 
stressed/stressed division (which is more commonly 
encoded in dfa models of German phonotactics), 
which disappears at r = 0.4 (in fact, even earlier, at 
t = 0.47). 
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Figure 9: Cluster tree for English syllable sets. 



4.3.2 Generalisation of Multi- Syllable OFS 
Model for Dutch 

The cluster tree for Dutch (Figure ^) also reveals an 
important division between final and monosyllables 
on the one hand, and initial and medial syllables 
on the other. However, it is not as clearly marked 
as in German. There is a point (r = 0.4) when 
all final and monosyllables are in the same cluster, 
but this is not the case for the initial and medial 
syllables, which form subclusters that are correlated 
with stress. The medial plain and posttonic syllable 
sets are merged with each other at r = 0.6, and with 
the initial stressed and medial stressed syllables at 
r = 0.4. But there is no greater similarity between 
this cluster and the cluster of inital pretonic and 
plain syllables (formed at r = 0.4) than there is 
between it and the cluster of final and monosyllables. 
All three are merged into a single cluster at r = 0.3. 

4.3.3 Generalisation of Multi- Syllable OFS 
Model for English 

In the cluster tree for English (Figure ||) , there are 
clusters clearly correlated with stress and clusters 
clearly correlated with position. At r = 0.3 three 
clusters are formed, one containing all medial sylla- 
ble sets except the stressed medial syllables, another 
containing all final syllable sets except the stressed 
final syllables, and the third containing two stressed 
syllable sets. At r — 0.25, all stressed syllables 
together form one cluster. However, at r — 0.2, 
two unstressed syllable sets are added to this clus- 
ter, while all the remaining unstressed sets form the 
other large cluster. Thus, in English, both stress and 
position are strong determinants of phonotactic vari- 
ation, but differences resulting from stress are more 
pronounced than those resulting from position. 



4.4 Discussion 

The LIP approach implemented with OFS Modelling 
proceeds in three steps. First, the factors likely to 
produce phonotactic idiosyncracy (stress and posi- 
tion in the above examples) , and the constituents to 
be used in the analysis (syllables only in the above 
examples), are decided, and a prototype model is 
constructed on this basis. This prototype distin- 
guishes as many objects at level as there are pos- 
sible combinations of factors and lowest-level con- 
stituents. All ways in which these objects can com- 
bine to form higher-level constituents are encoded at 
the corresponding higher levels in the model. 

In the second step, the prototype is instantiated 
with data sets from different languages. The degree 
to which the instantiated models generalise over 
the given data is determined by the number of 
constituents and subcategories of constituents 
distinguished in the prototype. As an example, 
consider the different degrees to which three models 
that discriminate different numbers of syllable 
classes generalise over given data. All three models 
define words as sequences of syllables, and syllables 
as sequences of phonemes. The first model has 
only one syllable class, the second distinguishes 
four classes reflecting position in a word, and the 
third is the same as the model presented in the 
preceding section, i.e. distinguishes twelve syllable 
classes. After instantiation with the same data set 
of German phonological word forms from CELEX 
used previously, the three models will encode 
supersets of the data set that generalise over it 
to different degrees. Looking at subsets of words 
of the same length gives some impression of the 
differences. For instance, model 1 encodes 10, 598 
monosyllabic German words (the total number of 
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different syllables in the data), whereas models 2 
and 3 encode only 6, 841 monosyllables (the actual 
number of monosyllabic words in CELEX). The 
following table shows the number of bisyllabic words 
each model encodes. 



Model 



(1) Syll Syll 

(2) Syll _ini Syll_fin 

(3) (Syll _ini_pr Syll _fin_st)+ 

(Syll _ini _st Syll _fin_po) 



Bisyllabic words 
1.12 x 10 s 
2.67 x 10 7 

1.89 x 10 7 



Attested forms 



7.09 x 10 4 



Model 3 permits about 266 times as many bisyl- 
labic word forms as there are in CELEX, model 2 en- 
codes 1.4 times as many as model 3, and model 1 en- 
codes 4.2 times as many as model 2. Thus, through 
progressively finer grained subcategories of syllables, 
progressively closer approximations of the set of at- 
tested forms can be achieved. 

However, doing this in an indiscriminate, 
language-independent way may produce some syl- 
lable classes that are very similar. With OS- 
generalisation, the most similar classes can be 
merged, so that only strongly marked differences are 
preserved. However, setting r to any specific value 
is problematic. Producing cluster trees with a range 
of r values can give some idea of important class dis- 
tinctions, and can be used as a basis for determining 
an appropriate r value, r can further be motivated 
by different linguistic assumptions and the intended 
purpose of the generalised models. Generalising dif- 
ferent instantiations of the same prototype for the 
same r value, makes it possible to compare the rela- 
tive markedness of phonotactic variation in different 
languages. 

5 Summary and Further Research 

This paper described how OFS modelling and 
the multi-syllable approach can be combined 
with language-independent prototyping to create a 
method for designing phonotactic models that (i) fa- 
cilitates automatic model construction, (ii) produces 
models that are arbitrarily close approximations of 
the set of wellformed phonological words in a given 
language, and (iii) provides a generalisation method 
with control over the degree to which final models 
fit given data. Extensions of the approach currently 
under investigation include stochastic OFS models, 
and the integration of OFS models into finite-state 
syntactic grammars. 
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